Deep Learning Math · The Lab

The Tiny Transformer Lab

a real, miniature GPT — with attention you can watch

The real architecture behind ChatGPT, shrunk until it fits in your browser: token + positional embeddings, a residual stream, stacked blocks of multi-head self-attention + a GELU MLP, LayerNorm, and a softmax head — trained live by genuine, gradient-checked backprop.

this build: {{ nLayers }} layers · {{ nHeads }} heads · {{ dDim }} dims · ctx {{ T }} · ~{{ paramCount }} weights. same shape as GPT-2, just smaller — so it babbles, but with real attention. the model lives in transformer-engine.js, a standalone file you can open and edit.

one transformer block, the way GPT builds it

Embed +Pos

Ch.4

LayerNorm

normalize

Multi-Attn

Ch.4+6

MLP+GELU

Ch.5

Unembed

Ch.6

Backprop

Ch.2

residual stream runs straight through — every block reads it, adds to it, passes it on · ×{{ nLayers }} blocks

Train the transformer

Backprop through {{ nLayers }} blocks of real attention. The loss — surprise at the true next letter — ticks down.

            step {{ step }}
            loss {{ lossDisp }}
          

learn rate {{ lrDisp }}

          {{ nLayers }} layers · {{ nHeads }} heads · {{ dDim }} dims · vocab {{ V }} · ctx {{ T }}
optimizer Adam · trained on {{ wordsN }} names · engine gradient-checked.
        

Watch attention

Each head learns its own pattern. Pick a layer and head, type a word, and watch every letter look back. Train and see them sharpen.

layer

head

row = letter looking · column = letter looked at · lower triangle only: no peeking ahead.

Generate

Feed the model its own output, using the full context each step. Train it first; fresh weights spit noise.

temp {{ tempDisp }}

The trick behind attention

“Look at the past” is just multiplying by a matrix. Step through how a plain average becomes learned attention — same shape, smarter weights.

From this to GPT-2 is only a matter of more.

Same architecture, every dial turned up — this lab already has the real pieces (multi-head, residual stream, LayerNorm, GELU).

layers

{{ nLayers }} → 12

heads

{{ nHeads }} → 12

dimensions

{{ dDim }} → 768

context

{{ T }} → 1024

vocabulary

{{ V }} → 50,257

weights

~{{ paramCount }} → 124M

Then train on the internet instead of names. The babble becomes language — but the dot products, softmax, residual stream, and gradient steps are exactly what you're running now.